Linear Programming for Finite State Multi-Armed Bandit Problems

نویسندگان

  • Yih Ren Chen
  • Michael N. Katehakis
چکیده

1. iBtrodactfMu An important sequential control problem with a tractable solution is the multi-armed bandit problem. It can be stated as follows. There are N independent projects, e.g., statistical populations (see Robbins 19S2), gambling machines (or bandits) etc.. The state of the pth of them at time t is denoted by x,it) and it belongs to a set of possible states S, which in this paper is assumed to be finite. Let 5, =: { 1 , . . . , A,}. At each point in time one can work on one project only and if the i>th of them is selected, one receives a reward r(f) = r'it) and its state changes according to a stationary transition rule: py = P(x,(/+ l)=y|x^(/)= /) whfle the states of all other projects remain unchanged: x,(< -f1) = x^it) iS K¥= P. Let xit) = (x, ( f ) , . . . , x^it)) and let wit) denote the project selected at time /. The states of all projects are observable and the problem is to choose ir(/) as a function of xit), so as to maximize the expected total discounted reward, given an initial state x(0):

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Tax Problems in the Undiscounted Case

The aim of this paper is to evaluate the performance of the optimal policy (the Gittins index policy) for open tax problems of the type considered byKlimov in the undiscounted limit. In this limit, the state-dependent part of the cost is linear in the state occupation numbers for the multi-armed bandit, but is quadratic for the tax problem. The discussion of the passage to the limit for the tax...

متن کامل

Large-Scale Bandit Problems and KWIK Learning

We show that parametric multi-armed bandit (MAB) problems with large state and action spaces can be algorithmically reduced to the supervised learning model known as “Knows What It Knows” or KWIK learning. We give matching impossibility results showing that the KWIKlearnability requirement cannot be replaced by weaker supervised learning assumptions. We provide such results in both the standard...

متن کامل

Four proofs of Gittins' multiarmed bandit theorem

We study four proofs that the Gittins index priority rule is optimal for alternative bandit processes. These include Gittins’ original exchange argument, Weber’s prevailing charge argument, Whittle’s Lagrangian dual approach, and Bertsimas and Niño-Mora’s proof based on the achievable region approach and generalized conservation laws. We extend the achievable region proof to infinite countable ...

متن کامل

Complexity Constraints in Two - Armed Bandit Problems : An Example

This paper derives the optimal strategy for a two armed bandit problem under the constraint that the strategy must be implemented by a finite automaton with an exogenously given, small number of states. The idea is to find learning rules for bandit problems that are optimal subject to the constraint that they must be simple. Our main results show that the optimal rule involves an arbitrary init...

متن کامل

An Optimal Algorithm for Linear Bandits

We provide the first algorithm for online bandit linear optimization whose regret after T rounds is of order √ Td lnN on any finite class X ⊆ R of N actions, and of order d √ T (up to log factors) when X is infinite. These bounds are not improvable in general. The basic idea utilizes tools from convex geometry to construct what is essentially an optimal exploration basis. We also present an app...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Math. Oper. Res.

دوره 11  شماره 

صفحات  -

تاریخ انتشار 1986